
Brian Gerber and Marc Kéry






“We are rarely explicit about the goals of our models…though should be!”
Is my model good ?
Inference vs prediction

\[ \hat{p} \pm t_{\alpha/2} \sqrt(\frac{\hat{p}(1-\hat{p})}{n}\frac{N-n}{n-1}) \]
Probability as the basis for statistical inference
The world is uncertain; processes are often not perfectly known (i.e., deterministic)
Need to draw conclusions, make decisions, or learn from observations in the face of resulting uncertainty
Probability: branch of mathematics dealing with chance processes and their outcomes

Connect random variables, probabilities, and parameters
define prob. functions
use/plot prob. functions
notation!
Probability and statistics are the opposite sides of the same coin.
To understand statistics, we need to understand probability and probability functions.
The two key things to understand this connection is the random variable (RV) and parameters (e.g., \(\theta\), \(\sigma\), \(\epsilon\), \(\mu\)).
Why learn about RVs and probability math?
Foundations of:
Our Goal:
\[ \begin{align*} a =& 10 \\ b =& \text{log}(a) \times 12 \\ c =& \frac{a}{b} \\ y =& \beta_0 + \beta_1\times c \end{align*} \]
All variables here are scalars. They are what they are and that is it. scalars.
Scalars are quantities that are fully described by a magnitude (or numerical value) alone.
\[ y \sim f(y) \]
\(y\) is a random variable which may change values each observation; it changes based on a probability function, \(f(y)\).
The tilde (\(\sim\)) denotes “has the probability distribution of”.
Which value (y) is observed is predictable. Need to know parameters (\(\theta\)) of the probability function \(f(y)\).
\[ y \sim f(y|\theta) \]
where ‘|’ is read as ‘given’.
Toss of a coin
Roll of a die
Weight of a captured elk
Count of plants in a sampled plot
\[ y \sim f(y) \]
The values observed can be understand based on the frequency within the population or presumed super-population. These frequencies can be described by probabilities.

We often only get to see ONE sample from this distribution.

We are often interested in the characteristics of the whole population of frequencies,
We infer what these are based on our sample (i.e., statistical inference).
Frequentist Paradigm:
Data (e.g., \(y\)) are random variables that can be described by probability distributions with unknown parameters that (e.g., \(\theta\)) are fixed (scalars).
Bayesian Paradigm:
Data (e.g., \(y\)) are random variables (when observed, then fixed) that can be described by probability functions where the unknown parameters (e.g., \(\theta\)) are also random variables that have probability functions that describe them.
\[ \begin{align*} y =& \text{ event/outcome} \\ f(y|\boldsymbol{\theta}) =& [y|\boldsymbol{\theta}]= \text{ process governing the value of } y \\ \boldsymbol{\theta} =& \text{ parameters} \\ \end{align*} \]
\(f()\) or [ ] is conveying a function (math).
It is called a PDF when \(y\) is continuous and a PMF when \(y\) is discrete.
We commonly use deterministic functions (indicated by non-italic letter); e.g., log(), exp(). Output is always the same with the same input. \[ \hspace{-12pt}\text{g} \\ x \Longrightarrow\fbox{DO STUFF } \Longrightarrow \text{g}(x) \]
\[ \hspace{-14pt}\text{g} \\ x \Longrightarrow\fbox{+7 } \Longrightarrow \text{g}(x) \]
\[ \text{g}(x) = x + 7 \]
Probability: Interested in \(y\), the data, and the probability function that “generates” the data. \[ \begin{align*} y \leftarrow& f(y|\boldsymbol{\theta}) \\ \end{align*} \]
Statistics: Interested in population characteristics of \(y\); i.e., the parameters,
\[ \begin{align*} y \rightarrow& f(y|\boldsymbol{\theta}) \\ \end{align*} \]
Special functions with rules to guarantee our logic of probabilities are maintained.
\(y\) can only be a certain set of values.
These sets are called the sample space (\(\Omega\)) or the support of the RV.
\[ f(y) = P(Y=y) \]
Data has two outcomes (0 = dead, 1 = alive)
\(y \in \{0,1\}\)
There are two probabilities
Axiom 1: The probability of an event is greater than or equal to zero and less than or equal to 1.
\[ 0 \leq f(y) \leq 1 \] Example,
Axiom 2: The sum of the probabilities of all possible values (sample space) is one.
\[ \sum_{\forall i} f(y_i) = f(y_1) + f(y_2) + ... = P(\Omega) =1 \] Example,
Still need to define \(f()\), our PMF for \(y \in \{0,1\}\)
\[ f(y|\theta) = [y|\theta]= \begin{align} \theta^{y}\times(1-\theta)^{1-y} \end{align} \]
\(\theta\) = P(Y = 1) = 0.2
\[ f(y|\theta) = [y|\theta]= \begin{align} = 0.2^{1}\times(1-0.2)^{0-0} \end{align} \]
\[ f(y|\theta) = [y|\theta]= \begin{align} = 0.2 \times (0.8)^{0} = 0.2 \end{align} \]
\[ f(y|\theta) = [y|\theta]= \begin{align} \theta^{y}\times(1-\theta)^{1-y} \end{align} \]
The support (\(\Omega\)):
Parameter space support (\(\Theta\)):
What would our data look like for 10 ducks that had a probability of survival (Y=1) of 0.20?
How about to evaluate the sample size of ducks needed to estimate \(\theta\)?
Use a probability function that makes sense for your data/RV; has the correct support.
In Bayesian infernece, we also pick prob. functions that make sense for parameters.
Where do you find information on probability functions?
The Normal/Gaussian distribution describes the sample space for all values on the real number line.
\[y \sim \text{Normal}(\mu, \sigma) \\ y \in (-\infty, \infty) \\ y \in \mathbb{R}\]
What is the parameter space for \(\mu\) and \(\sigma\)?
We collect data on adult alligator lengths (in).
[1] 90.30 83.02 103.67 85.17 99.20 106.74 90.76 105.28 99.41 101.72
Should we use the Normal Distribution
to estimate the mean?
Does the support of our data match
the support of the PDF?
What PDF does?

Are they exactly the same?
The issue is when the data are near 0, we might estimate non-sensical values (e.g. negative).

\(y\) are an uncountable set of values.
Provide ecological data examples that match the support?
PDFs of continious RVs follow the same rules as PMFs.
Axiom 1:
PDFs output probability densities, not probabilities.
Axiom 2:
\[ y \sim \text{Normal}(\mu, \sigma) \\ f(y|\mu,\sigma ) = \frac{1}{\sigma\sqrt{2\pi}}e^{\frac{1}{2}(\frac{y-\mu}{\sigma})^{2}} \\ \]

The math,
\[ \int_{120}^{\infty} f(y| \mu, \sigma)dy = P(120<Y<\infty) \]
Read this as “the integral of the PDF between 120 and infinity (on the left-hand side) is equal to the probability that the outcome of the random variable is between 120 and infinity (on the right-hand side)”.
Axiom 3:
The sum of the probability densities of all possible outcomes is equal to 1.
Properties of all probability functions.
Normal Distribution: parameters (\(\mu\) and \(\sigma\)) are 1\({^{st}}\) and 2\({^{nd}}\) moments
Gamma Distribution: parameters are not moments
Shape = \(\alpha\), Rate = \(\beta\)
OR
Shape = \(\kappa\), Scale = \(\theta\), where \(\theta = \frac{1}{\beta}\)
NOTE: probability functions can have Alternative Parameterizations, such they have different parameters.
Moments are functions of these parameters:
mean = \(\kappa\theta\) or \(\frac{\alpha}{\beta}\)
var = \(\kappa\theta^2\) or \(\frac{\alpha}{\beta^2}\)
Probability:
\[ \begin{align*} y \sim& f(y|\boldsymbol{\theta'}) \\ \end{align*} \]
\[ \begin{align*} \boldsymbol{\theta'} =& \begin{matrix} [\kappa & \theta] \end{matrix} \\ f(y|\boldsymbol{\theta}') &= \text{Gamma(}\kappa, \theta) \\ \end{align*} \]
\[ \begin{align*} f(y|\boldsymbol{\theta}') &= \frac{1}{\Gamma(\kappa)\theta^{\kappa}}y^{\kappa-1} e^{-y/\theta} \\ \end{align*} \]

What is the probability we would sample a value >40?
In this population, how common is a value >40?
What is the probability of observing \(y\) < 20
What is the probability of observing 20 < \(y\) < 40
Reverse the question: What values of \(y\) and lower have a probability of 0.025
What values of \(y\) and higher have a probability of 0.025
\[ f(y|\lambda) = P(Y=y) = \frac{\lambda^ye^{-\lambda}}{y!} \]
The full PMF (for \(\lambda\) = 2):

Statistics: Interested in estimating population-level characteristics; i.e., the parameters
\[ \begin{align*} y \rightarrow& f(y|\boldsymbol{\theta}) \\ \end{align*} \]
REMEMBER
\(f(y|\boldsymbol{\theta})\) is a probability statement about \(y\), NOT \(\boldsymbol{\theta}\).
Statistics: Interested in estimating population-level characteristics; i.e., the parameters
\[ \begin{align*} y \rightarrow& f(y|\boldsymbol{\theta}) \\ \end{align*} \]
All the evidence/information in a sample (\(\textbf{y}\), i.e., data) relevant to making inference on model parameters (\(\theta\)) is contained in the likelihood function.
The sample data, \(\textbf{y}\)
A probability function for \(\textbf{y}\):
\(f(\textbf{y};\theta)\) or \([\textbf{y}|\theta]\) or \(P(\textbf{y}|\theta)\)
the unknown parameter(s) (\(\theta\)) of the probability function
\[ \begin{align*} \mathcal{L}(\boldsymbol{\theta}|y) = P(y|\boldsymbol{\theta}) = f(y|\boldsymbol{\theta}) \end{align*} \]
The likelihood (\(\mathcal{L}\)) of the unknown parameters, given our data, can be calculated using our probability function.
For example, for \(y_{1} \sim \text{Normal}(\mu,\sigma = 1)\)
CODE:
[1] 0.05399097
If we knew the mean is truly 8, it would also be the probability density of the observation y = 10. But, we don’t know what the mean truly is.
For example, for \(y_{1} \sim \text{Normal}(\mu,\sigma = 1)\)
The key is to understand that the likelihood values are relative, which means we need many guesses.
CODE:

Central Tenet: evidence is relative.
Parameters are not RVs. They are not defined by a PDF/PMF.
MLEs are consistent. As sample size increases, they will converge to the true parameter value.
MLEs are asymptotically unbiased. The \(E[\hat{\theta}]\) converges to \(\theta\) as the sample size gets larger.
No guarantee that MLE is unbiased at small sample size.
MLEs will have the minimum variance among all estimators, as the sample size gets larger.
What is the mean height of King Penguins?

We go and collect data,
\(\boldsymbol{y} = \begin{matrix} [4.34 & 3.53 & 3.75] \end{matrix}\)
Let’s decide to use the Normal Distribution as our PDF.
\[ \begin{align*} f(y_1 = 4.34|\mu,\sigma) &= \frac{1}{\sigma\sqrt(2\pi)}e^{-\frac{1}{2}(\frac{y_{1}-\mu}{\sigma})^2} \\ \end{align*} \]
AND
\[ \begin{align*} f(y_2 = 3.53|\mu,\sigma) &= \frac{1}{\sigma\sqrt(2\pi)}e^{-\frac{1}{2}(\frac{y_{2}-\mu}{\sigma})^2} \\ \end{align*} \]
AND
\[ \begin{align*} f(y_3 = 3.75|\mu,\sigma) &= \frac{1}{\sigma\sqrt(2\pi)}e^{-\frac{1}{2}(\frac{y_{3}-\mu}{\sigma})^2} \\ \end{align*} \]
Or simply,
\[ \textbf{y} \stackrel{iid}{\sim} \text{Normal}(\mu, \sigma) \]
\(iid\) = independent and identically distributed
The joint probability of our data with shared parameters \(\mu\) and \(\sigma\),
\[ \begin{align*} & P(Y_{1} = y_1,Y_{2} = y_2, Y_{3} = y_3 | \mu, \sigma) \\ \end{align*} \]
\[ P(Y_{1} = 4.34,Y_{2} = 3.53, Y_{3} = 3.75 | \mu, \sigma) \]
IF each \(y_{i}\) is independent, the likelihood of our parameters is simply the multiplication of all three probability densities,
\[ \begin{align*} =& f(4.34|\mu, \sigma)\times f(3.53|\mu, \sigma)\times f(3.75|\mu, \sigma) \end{align*} \] \[ \begin{align*} =& \prod_{i=1}^{3} f(y_{i}|\mu, \sigma) \\ =& \mathcal{L}(\mu, \sigma|y_{1},y_{2},y_{3}) \end{align*} \]
We can do this because we are assuming knowing one observation does not tell us any new information about another observation.
\(P(y_{2}|y_{1}) = P(y_{2})\)
Translate the math to code…
Calculate likelihood of many guesses of \(\mu\) and \(\sigma\) simultaneously,
What happens to the likelihood if we increase the sample size to N=100?